Fault-Tolerant Shared Memory Simulations
نویسندگان
چکیده
We consider the problem of simulating a PRAM on a faulty distributed memory machine (DMM). We focus on dynamic faults, i.e. each processor or memory module independently fails during the simulation of a PRAM step with fixed probability and remains faulty for the rest of the simulation. We build upon randomized hashing-based simulations on non-faulty DMMs from [14], which achieve delay O(log log n), with high probability. We design and analyze routines for handling faults occurring during the simulation. Based on these routines we present simulations on faulty DMMs with the same delayO(log log n) as in the non-faulty case, provided that the failure probability of processors and modules is small enough to guarantee an expected linear number of processors and modules to survive the simulation. Thus the facility of being resilient to memory or processor faults increases the delay of the simulation at most by a constant factor.
منابع مشابه
Using Peer Support to Reduce Fault-Tolerant Overhead in Distributed Shared Memories
We present a peer logging system for reducing performance overhead in fault-tolerant distributed shared memory systems. Our system provides fault-tolerant shared memory using individual checkpointing and rollback. Peer logging logs DSM modification messages to remote nodes instead of to local disks. We present results for implementations of our fault-tolerant technique using simulations of both...
متن کاملDesign and Analysis of a Dynamically Reconfigurable Shared Memory Cluster
In recent years, the clusters have become a viable and less expensive alternative to multiprocessor systems. This paper proposes an architecture with a load balancing and a fault tolerant model for shared memory clusters. A task clustering algorithm, a Centralized dynamic load balancing model, a load balancing algorithm and a fault tolerant model are proposed for shared memory clusters. The res...
متن کاملPractical Schemes using Logs for Lightweight Recoverable DSM
In the existing Fault-Tolerant Software Distributed Shared Memory (FT-SDSM) with the message logging, the logs are used only to recover the failed nodes. In our previous work, we have implemented a lightweight logging protocol, called remote logging, on the SDSM for fault tolerance, which incurs low logging overhead with a fast network and a remote memory for back-up data. In this paper, we pro...
متن کاملFault Tolerance and Performance of Multipath Multistage Interconnection Networks
In building a multiprocessor system, we can minimize the system's mean time to failure by providing an architecture resilient to component faults. We compare the fault tolerance and performance characteristics of various fault-tolerant multistage interconnection networks. We primarily focus on networks composed of dilated routing components. A dilated router features redundant outputs in each l...
متن کاملA Hierarchical Shared Memory Cluster Architecture with Load Balancing and Fault Tolerance
Recently a great deal of attention has been paid to the design of hierarchical shared memory cluster system. Cluster computing has made hierarchical computing systems increasingly common as target environment for large-scale scientific computations. This paper proposes hierarchical shared memory cluster architecture with load balancing and fault tolerance. Hierarchies of shared memory and cache...
متن کامل